传输层(Transport Layer)

Transport services(运输层服务)

Transport services and proto

  • provide logical communication(逻辑通信) between app processes running on different hosts

  • transport protocols run in end systems

    • send side: breaks app messages into segments, passes to network layer
    • rcv side: reassembles segments(报文段) into messages, passes to app layer
  • more than one transport protocol available to apps

    • Internet: TCP and UDP

Transport vs. network layer(运输层和网络层的关系)

  • network layer: logical communication between hosts

  • transport layer: logical communication between processes

    • relies on, enhances, network layer services

e.g.:
12 kids in Ann’s house sending letters to 12 kids in Bill’s house

  • hosts = houses
  • processes = kids
  • app messages = letters in envelopes
  • transport protocol = Ann and Bill who demux to in-house siblings
  • network-layer protocol = postal service

Internet transport-layer protocols(因特网传运输层)

  • reliable, in-order delivery (TCP)
    • congestion control
    • flow control
    • connection setup
  • unreliable, unordered delivery: UDP

    • no-frills extension of “best-effort” IP(尽力而为)
  • services not available:
  • delay guarantees
  • bandwidth guarantees

multiplexing and demultiplexing(多路复用与多路分解)

multiplexing at sender: handle data from multiple sockets, add transport header (later used for demultiplexing)

demultiplexing at receiver: use header info to deliver received segments to correct socket

How demultiplexing works

host receives IP datagrams

  • each datagram has source IP address, destination IP address
  • each datagram carries one transport-layer segment
  • each segment has source, destination port number

TCP_UDP_segment_format.PNG

host uses IP addresses & port numbers to direct segment to appropriate socket

Connectionless demultiplexing(无连接的多路分解)

  • created socket has host-local port #:

    1
    DatagramSocket mySocket1 = new DatagramSocket(12534);
  • when creating datagram to send into UDP socket, must specify

    • destination IP address
    • destination port #
  • when host receives UDP segment:

    • checks destination port # in segment
    • directs UDP segment to socket with that port #

IP datagrams with same dest. port #, but different source IP addresses and/or source port numbers will be directed to same socket at dest

Connection-oriented demux(面向连接的多路分解)

TCP socket identified by 4-tuple:

  • source IP address
  • source port number
  • dest IP address
  • dest port number

demux: receiver uses all four values to direct segment to appropriate socket

server host may support many simultaneous TCP sockets:

  • each socket identified by its own 4-tuple

  • web servers have different sockets for each connecting client

    • non-persistent HTTP will have different socket for each request

connectionless transport: UDP(无连接运输: UDP)

UDP: User Datagram Protocol [RFC 768]

  • 关于何时、发送什么数据的应用层控制更为精细
  • 无需连接建立
  • 无连接状态
  • 分组首部开销小

  • “no frills,” “bare bones” Internet transport protocol

  • “best effort” service, UDP segments may be:
    • lost
    • delivered out-of-order to app
  • connectionless:

    • no handshaking between UDP sender, receiver
    • each UDP segment handled independently of others
  • UDP use:

    • streaming multimedia apps (loss tolerant, rate sensitive)
    • DNS
    • SNMP
  • reliable transfer over UDP:

    • add reliability at application layer
    • application-specific error recovery

UDP: segment header(UDP报文段首部)

UDP_segment_format.PNG

length: in bytes of UDP segment, including header

  • why is there a UDP
    • no connection establishment (which can add delay)
    • simple: no connection state at sender, receiver
    • small header size
    • no congestion control: UDP can blast away as fast as desired

UDP checksum(UDP检验和)

  • 端到端原则(end-end principle)

  • Goal: detect “errors” (e.g., flipped bits) in transmitted segment

  • sender:

    • treat segment contents, including header fields, as sequence of 16-bit integers
    • checksum: addition (one’s complement sum) of segment contents
    • sender puts checksum value into UDP checksum field
  • receiver:

    • compute checksum of received segment
    • check if computed checksum equals checksum field value:
    • NO - error detected
    • YES - no error detected. But maybe errors nonetheless? More later ….

e.g.: add two 16-bit integers

UDP_check.PNG

Note: when adding numbers, a carryout from the most significant bit needs to be added to the result

UDP Pseudo-Header(UDP伪首部)

UDP_pseudo_header.PNG

  • Protocol – 17 (UDP)

e.g. Checksum calculation of a simple UDP user datagram

UDP_checkSum_calcute.png

  • All 0s : Pending to 16bits

principles of reliable data transfer(可靠数据传输原理)

important in application, transport, link layers
: top-10 list of important networking topics

  • characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

  • rdt: reliable data transfer

reliable_data_transfer.png

  • rdt_send(): called from above, (e.g., by app.). Passed data to deliver to receiver upper layer

  • udt_send(): called by rdt, to transfer packet over unreliable channel to receiver

  • rdt_rcv(): called when packet arrives on rcv-side of channel

  • deliver_data(): called by rdt to deliver data to upper

rdt1.0: reliable transfer over a reliable channel(经完全可靠信道的可靠数据传输)

  • incrementally develop sender, receiver sides of reliable data transfer protocol (rdt)

  • consider only unidirectional data transfer

    • but control info will flow on both directions!
  • use finite state machines (FSM) to specify sender, receiver

  • underlying channel perfectly reliable

    • no bit errors
    • no loss of packets
  • separate FSMs for sender, receiver:

    • sender sends data into underlying channel
    • receiver reads data from underlying channel

rdt1.0.PNG

rdt2.0: channel with bit errors(经具有比特差错信道的可靠数据传输)

  • underlying channel may flip bits in packet
    • checksum to detect bit errors
  • the question: how to recover from errors:

    • acknowledgements(ACKs, 肯定确认): receiver explicitly tells sender that pkt received OK
    • negative acknowledgements(NAKs, 否定确认): receiver explicitly tells sender that pkt had errors
    • sender retransmits(重传) pkt on receipt of NAK
  • 自动重传请求协议(Automatic Repeat reQuest, ARQ)

  • new mechanisms in rdt2.0 (beyond rdt1.0):

    • error detection(差错检测)
    • receiver feedback(接收方反馈): control msgs (ACK,NAK) rcvr->sender
  • 停等(stop-and-wait)协议

rdt2.0.PNG

rdt2.0 has a fatal flaw

  • what happens if ACK/NAK corrupted
    • sender doesn’t know what happened at receiver
    • can’t just retransmit: possible duplicate
  • 冗余分组(duplicate packet)

  • handling duplicates:

    • sender retransmits current pkt if ACK/NAK corrupted
    • sender adds sequence number to each pkt
    • receiver discards (doesn’t deliver up) duplicate pkt
  • sender sends one packet, then waits for receiver response

rdt2.1: sender, handles garbled(含糊不清的) ACK/NAKs

rdt2.1.PNG

  • sender:
    • seq # added to pkt
    • two seq. #’s (0,1) will suffice.
    • must check if received ACK/NAK corrupted
    • twice as many states
      • state must “remember” whether “expected” pkt should have seq # of 0 or 1
  • receiver:

    • must check if received packet is duplicate
    • state indicates whether 0 or 1 is expected pkt seq #
    • note: receiver can not know if its last ACK/NAK received OK at sender

rdt2.2: a NAK-free protocol

  • same functionality as rdt2.1, using ACKs only
  • instead of NAK, receiver sends ACK for last pkt received OK
    • receiver must explicitly include seq # of pkt being ACKed
  • duplicate ACK at sender results in same action as NAK: retransmit current pkt

rdt2.2.PNG

rdt3.0: channels with errors and loss(经具有比特差错的丢包信道的可靠数据传输)

new assumption: underlying channel can also lose packets (data, ACKs)

  • checksum, seq. #, ACKs, retransmissions will be of help … but not enough

approach: sender waits “reasonable” amount of time for ACK

  • retransmits if no ACK received in this time
  • if pkt (or ACK) just delayed (not lost):
  • retransmission will be duplicate, but seq. #’s already handles this
  • receiver must specify seq # of pkt being ACKed
  • requires countdown timer

rdt3.0_sender.png

rdt3.0 in action

rdt3.0_no_loss.png

rdt3.0_packet_loss.png

rdt3.0_ack_loss.png

rdt3.0_premature_timeout_delayed_ack.png

Performance of rdt3.0

rdt3.0: stop-and-wait operation(停等)

rdt3.0_stop_and_wait.png

  • rdt3.0 is correct, but performance stinks
  • e.g.: 1 Gbps link, 15 ms prop. delay, 8000 bit packet:
  • $D_{trans} = \frac{L}{R} = \frac{8000bits}{10^9bits/sec} = 8microsecs$
  • RTT = 30ms

  • used ratio

  • $ U_{sender} = \frac{L/R}{RTT + L/R} = \frac{0.008}{30.008} = 0.00027 $
  • 33kB/sec thruput over 1 Gbps link
  • network protocol limits use of physical resources

Pipelined protocols(流水线可靠数据传输协议)

  • pipelining: sender allows multiple, “in-flight”, yet-to-be-acknowledged pkts
    • range of sequence numbers must be increased
    • buffering at sender and/or receiver
  • two generic forms of pipelined protocols: go-Back-N, selective repeat

pipeline_increasd_utilization.png

  • 3-packet pipelining increases utilization(利用率) by a factor of 3
  • $ U_{sender} = \frac{3L/R}{RTT + L/R} = \frac{0.024}{30.008} = 0.00081 $

  • Go-back-N(GBN, 回退N步):

    • sender can have up to N unacked packets in pipeline
    • receiver only sends cumulative ack
      • doesn’t ack packet if there’s a gap
    • sender has timer for oldest unacked packet
      • when timer expires, retransmit all unacked packets
  • Selective Repeat(SR, 选择重传):

    • sender can have up to N unack’ed packets in pipeline
    • receiver sends individual ack for each packet
    • sender maintains timer for each unacked packet
      • when timer expires, retransmit only that unacked packet

Go-Back-N

Sender

  • k-bit seq # in pkt header
  • “window” of up to N, consecutive unack’ed pkts allowed
  • ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”
    • may receive duplicate ACKs (see receiver)
  • timer for oldest in-flight pkt
  • timeout(n): retransmit packet n and all higher seq # pkts in window

goback_N.png

  • 窗口长度N
  • 滑动窗口协议(sliding-window protocol)

sender extended FSM

GBN_sender_FSM.png

receiver extended FSM

GBN_receiver_FSM.png

  • ACK-only: always send ACK for correctly-received pkt with highest in-order seq #
    • may generate duplicate ACKs
    • need only remember expectedseqnum
  • out-of-order pkt:
    • discard (don’t buffer): no receiver buffering
    • re-ACK pkt with highest in-order seq #

GBN in action

GBN_in_action.png

  • 累积确认(cumulative acknowledgment)

Selective repeat

  • receiver individually acknowledges all correctly received pkts
    • buffers pkts, as needed, for eventual in-order delivery to upper layer
  • sender only resends pkts for which ACK not received
    • sender timer for each unACKed pkt
  • sender window
    • N consecutive seq #’s
    • limits seq #s of sent, unACKed pkts

sender, receiver windows:

selective_repeat_windows.png

  • sender
    • data from above:
      • if next available seq # in window, send pkt
    • timeout(n):
      • resend pkt n, restart timer
    • ACK(n) in [sendbase,sendbase+N]:
      • mark pkt n as received
      • if n smallest unACKed pkt, advance window base to next unACKed seq #
  • receiver

    • pkt n in [rcvbase, rcvbase+N-1]
      • send ACK(n)
      • out-of-order: buffer
      • in-order: deliver (also deliver buffered, in-order pkts), advance window to next not-yet-received pkt
    • pkt n in [rcvbase-N,rcvbase-1]
      • ACK(n)
    • otherwise: ignore

Selective repeat in action

selective_repeat_in_sction.png

Selective repeat: dilemma

  • example:
  • seq #’s: 0, 1, 2, 3
  • window size=3

selective_repeat_dilemma.png

  • receiver sees no difference in two scenarios, duplicate data accepted as new in (b)

  • Q: what relationship between seq # size and window size to avoid problem in (b)?

  • SR协议中窗口长度必须小于或等于序号空间大小的一半

connection-oriented transport: TCP(面向连接的传输: TCP)

  • point-to-point: one sender, one receiver
  • reliable, in-order byte steam: no “message boundaries”
  • pipelined: TCP congestion and flow control set window size
  • full duplex data:
    • bi-directional data flow in same connection
    • MSS: maximum segment size
  • connection-oriented: handshaking (exchange of control msgs) inits sender, receiver state before data exchange
  • flow controlled: sender will not overwhelm receiver

  • 流(stream): 没有报文边界的概念

  • 最大报文段长度(MSS, Maximum Segment Size)

  • 最大链路层帧长度(MTU, Maximum Transmission Unit, 最大传输单元)

TCP segment structure(TCP报文段结构)

TCP_segment_structure.png

  • sequence numbers(序号字段): byte stream “number” of first byte in segment’s data
  • acknowledgements(确认号字段):
    • seq # of next byte expected from other side
    • cumulative ACK
  • Q: how receiver handles out-of-order segments
  • A: TCP spec doesn’t say, - up to implementor

  • 接收窗口字段(receive window): 用于流量控制,指示接收方愿意接受的字节数量

TCP_seq_ack_number.png

TCP_telnet.png

  • Q: how to set TCP timeout value?
  • longer than RTT but RTT varies
  • too short: premature timeout, unnecessary retransmissions
  • too long: slow reaction to segment loss

  • Q: how to estimate RTT(估计往返时间)?

  • SampleRTT: measured time from segment transmission until ACK receipt
    • ignore retransmissions
  • SampleRTT will vary, want estimated RTT “smoother”
    • average several recent measurements, not just current SampleRTT
  • $ EstimatedRTT = (1- \alpha)EstimatedRTT + \alphaSampleRTT $ *
  • exponential weighted moving average(EWMA, 指数加权移动平均)
  • influence of past sample decreases exponentially fast
  • typical value: $\alpha = 0.125$
  • timeout interval: EstimatedRTT plus “safety margin”
  • large variation in EstimatedRTT -> larger safety margin

  • estimate SampleRTT deviation from EstimatedRTT:
    $$
    DevRTT = (1-\beta)DevRTT +\beta |SampleRTT-EstimatedRTT|
    (typically, \beta = 0.25)
    $$

TimeoutInterval = EstimatedRTT(estimated RTT) + 4*DevRTT(“safety margin”)

TCP reliable data transfer(可靠数据传输)

  • TCP creates rdt service on top of IP’s unreliable service
    • pipelined segments
    • cumulative acks
    • single retransmission timer
  • retransmissions triggered by:
    • timeout events
    • duplicate acks
  • let’s initially consider simplified TCP sender:

  • ignore duplicate acks
  • ignore flow control, congestion control

TCP sender events:

  • data received from app:
    • create segment with seq #
    • seq # is byte-stream number of first data byte in segment
    • start timer if not already running
      • think of timer as for oldest unacked segment
      • expiration interval: TimeOutInterval
  • timeout:

    • retransmit segment that caused timeout
    • restart timer
  • ack received: if ack acknowledges previously unacked segments
    • update what is known to be ACKed
    • start timer if there are still unacked segments

TCP sender (simplified)

TCP_sender.png

retransmission scenarios:

TCP_retransmission_scenarios.png

TCP_retransmission_scenarios1.png

TCP ACK generation [RFC 1122, RFC 2581]

event at receiver TCP receiver action
arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK
arrival of in-order segment with expected seq #. One other segment has ACK pending immediately send single cumulative ACK, ACKing both in-order segments
arrival of out-of-order segment higher-than-expect seq #. Gap detected immediately send duplicate ACK, indicating seq. # of next expected byte
arrival of segment that partially or completely fills gap immediate send ACK, provided that segment starts at lower end of gap

TCP fast retransmit(快速重传)

  • time-out period often relatively long: long delay before resending lost packet
  • detect lost segments via duplicate ACKs.
    • sender often sends many segments back-to-back
    • if segment is lost, there will likely be many duplicate ACKs.
  • if sender receives 3 dupl ACKs for same data(“triple duplicate ACKs”), resend unacked segment with smallest seq #

  • likely that unacked segment lost, so don’t wait for timeout

TCP_fast_retransmit.png

TCP flow control(TCP流量控制)

receiver controls sender, so sender won’t overflow receiver’s buffer by transmitting too much, too fast

TCP_flow_control.png

  • receiver “advertises” free buffer space by including rwnd value in TCP header of receiver-to-sender segments
    • RcvBuffer size set via socket options (typical default is 4096 bytes)
    • many operating systems autoadjust RcvBuffer
  • sender limits amount of unacked (“in-flight”) data to receiver’s rwnd value
  • guarantees receive buffer will not overflow

TCP_flow_control_recvBuffer.png

Connection Management(TCP连接管理)

  • before exchanging data, sender/receiver “handshake”:
    • agree to establish connection (each knowing the other willing to establish connection)
    • agree on connection parameters

TCP_connection_management.png

  • Q: will 2-way handshake always work in network?
  • variable delays
  • retransmitted messages (e.g. req_conn(x)) due to message loss
  • message reordering
  • can’t “see” other side

2-way handshake failure scenarios:

TCP_connection_management1.png

TCP 3-way handshake(三次握手)

TCP_connection_management_3_way_handshake.png

TCP 3-way handshake: FSM

TCP_connection_management_3_way_handshake_FSM.png

TCP: closing a connection(四次挥手)

  • client, server each close their side of connection
    • send TCP segment with FIN bit = 1
  • respond to received FIN with ACK
    • on receiving FIN, ACK can be combined with own FIN
  • simultaneous FIN exchanges can be handled

TCP_connection_management_closing.png

Principles of congestion control(拥塞控制原理)

  • congestion:
  • informally: “too many sources sending too much data too fast for network to handle”
  • different from flow control!
  • manifestations:
    • lost packets (buffer overflow at routers)
    • long delays (queueing in router buffers)
  • a top-10 problem

Causes/costs of congestion: scenario

  • two senders, two receivers
  • one router, infinite buffers
  • output link capacity: R
  • no retransmission

TCP_principle_congestion.png

TCP_principle_congestion1.png

  • one router, finite buffers
  • sender retransmission of timed-out packet
    • application-layer input = application-layer output: $\lambda{in} = \lambda{out}$
    • transport-layer input includes retransmissions: $\lambda{in}^{‘} \geq \lambda{in}$

TCP_principle_congestion2.png

  • idealization: perfect knowledge
    • sender sends only when router buffers available

TCP_principle_congestion3.png

  • Idealization: known loss packets can be lost, dropped at router due to full buffers
    • sender only resends if packet known to be lost

TCP_principle_congestion4.png

  • Realistic: duplicates
    • packets can be lost, dropped at router due to full buffers
    • sender times out prematurely, sending two copies, both of which are delivered

TCP_principle_congestion5.png

TCP_principle_congestion6.png

  • “costs” of congestion:
    • more work (retrans) for given “goodput”
    • unneeded retransmissions: link carries multiple copies of pkt
      • decreasing goodput
  • four senders

  • multihop paths
  • timeout/retransmit

  • Q: what happens as $\lambda{in}$ and $\lambda{in}^{’}$ increase ?

  • A: as red $\lambda_{in}^{’}$ increases, all arriving blue pkts at upper queue are dropped, blue throughput $\to$ 0

TCP_principle_congestion7.png

TCP_principle_congestion8.png

  • another “cost” of congestion:
    • when packet dropped, any upstream transmission capacity used for that packet was wasted(上游路由器用于转发该分组而使用的传输容量最终被浪费掉了)

TCP congestion control: additive increase multiplicative decrease(AIMD, 加性增,乘性减)

  • approach: sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs
    • additive increase: increase cwnd by 1 MSS every RTT until loss detected
    • multiplicative decrease: cut cwnd in half after loss

TCP_congestion_aimd.png

TCP_congestion_cwnd.png

  • sender limits transmission: $ LastByteSend - LastByteAcked \leq cwnd $

  • cwnd(拥塞窗口长度) is dynamic, function of perceived network congestion

  • TCP sending rate:

    • roughly: send cwnd bytes, wait RTT for ACKS, then send more bytes
    • $ rate \approx \frac{cwnd}{RTT} bytes/sec $

TCP Slow Start(慢启动)

  • when connection begins, increase rate exponentially until first loss event:
    • initially cwnd = 1 MSS
    • double cwnd every RTT
    • done by incrementing cwnd for every ACK received
  • summary: initial rate is slow but ramps up exponentially fast

detecting, reacting to loss

  • loss indicated by timeout:
    • cwnd set to 1 MSS;
    • window then grows exponentially (as in slow start) to threshold, then grows linearly(进入拥塞避免状态)
  • loss indicated by 3 duplicate ACKs: TCP RENO(进入快速恢复状态)

    • dup ACKs indicate network capable of delivering some segments
    • cwnd is cut in half window then grows linearly
  • TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)(进入慢启动状态)

switching from slow start to CA(Congestion Avoid)

TCP_congestion_switching.png

  • Q: when should the exponential increase switch to linear?
  • A: when cwnd gets to 1/2 of its value before timeout.

  • Implementation:

    • variable ssthresh
    • on loss event, ssthresh is set to 1/2 of cwnd just before loss event

TCP_congestion_FSM.png

TCP throughput(TCP吞吐量)

  • avg. TCP thruput as function of window size, RTT?
    • ignore slow start, assume always data to send
  • W: window size (measured in bytes) where loss occurs

    • avg. window size (# in-flight bytes) is 3/4 W
    • avg. thruput is 3/4W per RTT
    • $ avg TCP throughput = \frac{3}{4} \frac{W}{RTT} bytes/sec $

TCP Futures: TCP over “long, fat pipes”(经高带宽路径的TCP)

  • example: 1500 byte segments, 100ms RTT, want 10 Gbps throughput
  • requires W = 83,333 in-flight segments
  • throughput in terms of segment loss probability, L [Mathis 1997]:
  • $ TCP thoughput = \frac{1.22*MSS}{RTT\sqrt{L}} $
  • to achieve 10 Gbps throughput, need a loss rate of L = 2·10-10 – a very small loss rate
  • new versions of TCP for high-speed

TCP Fairness(TCP公平性)

  • fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K

  • Why is TCP fair

  • two competing sessions:
    • additive increase gives slope of 1, as throughout increases
    • multiplicative decrease decreases throughput proportionally

TCP_congestion_fair.png

Fairness and UDP

  • multimedia apps often do not use TCP
    • do not want rate throttled by congestion control
  • instead use UDP:
    • send audio/video at constant rate, tolerate packet loss

Fairness, parallel TCP connections

  • application can open multiple parallel connections between two hosts
  • web browsers do this
  • e.g., link of rate R with 9 existing connections:
    • new app asks for 1 TCP, gets rate R/10
    • new app asks for 11 TCPs, gets R/2

Explicit Congestion Notification (ECN)

  • network-assisted congestion control:
    • two bits in IP header (ToS field) marked by network router to indicate congestion
    • congestion indication carried to receiving host
    • receiver (seeing congestion indication in IP datagram), sets ECE bit on receiver-to-sender ACK segment to notify sender of congestion